Model Selection

Multimodal Visual Understanding

# Multimodal Visual Understanding

Qwen2.5 VL 3B Instruct GGUF

Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring powerful visual understanding and multimodal processing capabilities.

Image-to-Text English

PE Lang G14 448

The Perception Encoder is a state-of-the-art image and video understanding encoder trained through vision-language training, with strong generalization capabilities.

PE Lang L14 448

The Perception Encoder (PE) is an advanced image and video understanding encoder trained through vision-language learning, achieving state-of-the-art performance on various visual tasks.

Qwen2.5-VL-32B-Instruct is the latest vision-language model in the Qwen family, featuring powerful visual understanding and intelligent agent capabilities, supporting multimodal task processing.

Transformers Supports Multiple Languages

Qwen2.5 VL 7B Instruct GGUF

Qwen2.5-VL-7B-Instruct is a multimodal vision-language model that supports image understanding and text generation tasks.

Image-to-Text English

Qwen2.5 VL Instruct 3B Geo

Qwen2.5-VL is the latest vision-language model in the Qwen family, focusing on enhanced visual understanding and agent capabilities.

Transformers English

Mlabonne Gemma 3 4b It Abliterated GGUF

This is a quantized version based on the mlabonne/gemma-3-4b-it-abliterated model, using llama.cpp for imatrix quantization, suitable for image-text-to-text tasks.

Toriigate V0.4 7B I1 GGUF

This is a weighted/importance matrix quantized version of the Minthy/ToriiGate-v0.4-7B model, offering multiple quantization options to suit different needs.

Image-to-Text English

Qwen2.5 VL 72B Instruct AWQ Fix

Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring powerful visual understanding and agent capabilities, supporting multi-format visual localization and structured output generation.

Transformers English

Qwen2.5 VL 72B Instruct AWQ

Qwen2.5-VL is a multimodal large language model launched by the QwenLM team, featuring powerful visual understanding and intelligent agent capabilities, supporting various input formats including images, videos, and text.

Transformers English

Qwen2.5 VL 7B Instruct AWQ

Qwen2.5-VL is a multimodal vision-language model launched by Tongyi Qianwen, featuring powerful image understanding and text generation capabilities.

Transformers English

Minicpm O 2 6 Gguf

MiniCPM-o 2.6 is a multimodal model that supports vision and language tasks, specifically designed for llama.cpp.

Razorback 12B V0.2

Razorback 12B v0.2 is a multimodal model combining the strengths of Pixtral 12B and UnslopNemo v3, featuring visual understanding and language processing capabilities.

Transformers Supports Multiple Languages

Llama 3.2 90B Vision Instruct Unsloth Bnb 4bit

Meta Llama 3.2 series 90B-parameter multimodal large language model supporting visual instruction understanding, optimized with Unsloth dynamic 4-bit quantization

Transformers English

Minicpm V 2 6 Rk3588 1.1.4

MiniCPM-V 2.6 is a GPT-4V-level multimodal large language model supporting single-image, multi-image, and video understanding, optimized for RK3588 NPU

Transformers Other

Cambrian is an open-source multimodal LLM (Large Language Model) designed with a vision-centric approach.

Phi 3 Vision 128k Instruct

Phi-3-Vision-128K-Instruct is a lightweight, cutting-edge open multimodal model supporting a 128K token context length, focusing on high-quality reasoning in text and visual domains.

Transformers Other

Llava Phi 3 Mini 4k Instruct

A vision-language model that combines the Phi-3-mini-3.8B large language model with LLaVA v1.5, providing advanced vision-language understanding capabilities.

Owlv2 Base Patch16

OWLv2 is a vision-language pre-trained model focused on object detection and localization tasks.

Object Detection

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase